Model Selection

Multimodal contrastive learning

# Multimodal contrastive learning

Eva02 Large Patch14 Clip 224.merged2b

The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.

Image Classification

Eva02 Enormous Patch14 Clip 224.laion2b

EVA-CLIP is a vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Vit H 14 CLIPA Datacomp1b

CLIPA-v2 model, an efficient contrastive vision-language model designed for zero-shot image classification tasks.

Vit H 14 CLIPA 336 Laion2b

CLIPA-v2 model, trained on the laion2B-en dataset, focusing on zero-shot image classification tasks

Vit B 16 SigLIP

SigLIP (Sigmoid Loss for Language Image Pre-training) model trained on the WebLI dataset for zero-shot image classification tasks.

CLIP ViT B 32 Laion2b E16

A vision-language pretrained model implemented based on OpenCLIP, supporting zero-shot image classification tasks

CLIP ViT L 14 CommonPool.XL S13b B90k

A vision-language pretrained model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval tasks

CLIP ViT B 32 CommonPool.M.clip S128m B4k

Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality

CLIP ViT B 32 CommonPool.S.laion S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

CLIP ViT B 32 CommonPool.S.image S13m B4k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

Eva02 Large Patch14 Clip 224.merged2b S4b B131k

EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Image Classification

Vit Large Patch14 Clip 336.openai

CLIP model developed by OpenAI, using ViT-L/14 architecture, supports zero-shot image classification tasks

CLIP ViT G 14 Laion2b S34b B88k

CLIP ViT-g/14 model trained on the LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks

Xclip Base Patch16 Zero Shot

X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained contrastively on (video, text) pairs, suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.

Transformers English

Clip Vit Base Patch32

CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.

The first contrastive language-image pretraining model for Italian, based on Italian BERT and ViT architecture, achieving competitive performance with only 1.4 million fine-tuned samples

Text-to-Image Other

Clip Vit Large Patch14

CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase